MPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
نویسندگان
چکیده
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault -tolerant MPI middleware. Environments include space -based, wide -area/web/meta computing, and scalable clusters. MPI/FT , the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non -portable code for event handlers and recovery management is isolated. User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self -checking threads address four levels of MPI implementation robustness, t hree of which are portable to any multi threaded MPI. A taxonomy of application types provides six initial fault -relevant models; user -transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT – real-time MPI – are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future.
منابع مشابه
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements...
متن کاملFailure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...
متن کاملFault Tolerant Communication Library and Applications for High Performance Computing
With increasing numbers of processors on todays machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becoming more of an important issue for both end-users and the institutions running the machines. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communica...
متن کاملFEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems
Ever-increasing demands of space missions for data returns from their limited processing and communications resources have made the traditional approach of data gathering, data compression, and data transmission no longer viable. Increasing on-board processing power by providing high-performance computing (HPC) capabilities using commercial-off-the-shelf (COTS) components is a promising approac...
متن کاملA Fault-Tolerant Communication Library for Grid Environments
With increasing numbers of processors and applications running in virtual Grid environments, application level fault-tolerance is getting more of an important issue. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or ...
متن کامل